Here I want to explore some aspects of the first batch of fastq files received using the K-mer Analysis Toolkit (Mapleson et al. 2016). K-mer counts offer a signature of the sequence compostion of a fastq file. The distribution of counts is related to how repetitive the content is. Many k-mers (words) counted few times indicate that most reads are quite different among them, as if coverage was very low or if they had a lot of random errors.
We can also compare the counts between two fastq files: the more similar the sequences in the two files, the more correlated the counts will be. With this, I can take a look at how much more similar missassigned reads are to those from their putative original sample than to those from any other sample. I should also be able to confirm that samples from the same species have more similar sequence compositions than unrelated fastq files.
Finally, I noticed that fastq files have reads from two lanes (indicated in the fourth “:”-separated field of the name of the read), and we have been told that lane 2 delivered low quality data. If I manage to distribute a fastq file in two, for the two lanes, I could also compare them.
K-mers can be counted on the sequenced strand or on both. Reads are oriented, because specific adapters were ligated to either enzyme’s cut site. That is, reads from a locus will always be sequenced from the same side. Thus, I don’t need to count canonical k-mers, but only in one strand.
I choose a few fastq files to analyse, just as examples.
CHOSEN=(LAN039 BRZ057 THU185 WAL078 175619 HOP006
HOP049 HOP026 HOP034 HOP052 HOP066 HOP038)
for sample in ${CHOSEN[@]}; do
if [ ! -d $sample ]; then mkdir $sample; fi
for fastq in $(ls ../../data/fastq/*.fastq.gz | grep $sample); do
READ=$(echo $fastq | cut -d '_' -f 4)
if [ ! -e $sample/$READ.hist ]; then
kat hist -t 2 -o -N $sample/$READ $fastq &> $sample/$READ.log &
fi
done
done
wait
if [ ! -e byRead ]; then
gawk '(FNR == 1){
split(FILENAME, A, /\//)
SAMPLE = A[2]
READ = substr(A[3], 2, 1)
ALLSAMPLES[SAMPLE] = 1
}(/^[^#]/){
F[SAMPLE, READ, $1] = $2
}END{
for (s in ALLSAMPLES) {
for (i = 1; i <= 2; i++) {
for (j = 1; j <= 2000; j++) {
print s "\t" i "\t" j "\t" F[s, i, j]
}
}
}
}' $(find . -name 'R[12]') | grep -v HOP > byRead.hist
fi
library('ggplot2')
byRead <- read.table('byRead.hist', col.names = c('Sample','Read','kmerFreq','kmerNum'),
colClasses = c('factor','factor','numeric','numeric'))
ggplot(data = byRead, mapping = aes(x = kmerFreq, y = kmerNum, color = Read)) +
geom_line() + xlim(1, 500) + scale_y_log10() + facet_wrap(~Sample)
## Warning: Removed 3000 row(s) containing missing values (geom_path).
Actaully, the mock samples do not have enough reads for a meaningful k-mer profile. In the rest, we can see the distribution of k-mer frequencies does resemble a mixture of an exponential (sequencing errors and possible contamination) and a normal that could be centered around 100. The problem is that the excess of low frequency k-mers makes it difficult to see the normal component. Changing the k up to 51 does not improve the profile.
Importantly, forward and reverse reads have very similar profiles.
The challenge is to split fastq files by lane. I can use gawk. In ordre to avoid writing and reading from additional fastq files, I wanted to make kat hist read from standard input, but it does not work. I raised an issue in KAT’s github. For the moment, I will have to use files.
CHOSEN=(LAN039 BRZ057 THU185 WAL078 175619)
for sample in ${CHOSEN[@]}; do
if [ ! -e $sample/lane1.hist ]; then
if [ ! -e $sample/lane1.fasta ]; then
gunzip -c ../../data/fastq/$sample*.fastq.gz | \
gawk -v SAMPLE=$sample '(NR % 4 == 1){
if (NR > 1) {
split(NAME,A,/:/)
printf ">%s\n%s\n", NAME, SEQ >> sprintf("%s/lane%i.fasta", SAMPLE, A[4])
}
NAME = $1
}(NR % 4 == 2){
SEQ = $1
}END{
split(NAME, A, /:/)
printf "%s\n%s\n", NAME, SEQ >> sprintf("%s/lane%i.fasta", SAMPLE, A[4])
}' &
fi
fi
done
wait
for sample in ${CHOSEN[@]}; do
if [ ! -e $sample/lane1.hist ]; then
kat hist -N -t 4 -h 2000 -o $sample/lane1.hist $sample/lane1.fasta &> $sample/lane1.log &
kat hist -N -t 4 -h 2000 -o $sample/lane2.hist $sample/lane2.fasta &> $sample/lane2.log &
fi
done
wait
if [ ! -e byLane.hist ]; then
gawk '(FNR == 1){
split(FILENAME, A, /\//)
SAMPLE = A[2]
LANE = substr(A[3], 5, 1)
ALLSAMPLES[SAMPLE] = 1
}(/^[^#]/){
F[SAMPLE, LANE, $1] = $2
}END{
for (s in ALLSAMPLES) {
for (i = 1; i <= 2; i++) {
for (j = 1; j <= 2000; j++) {
print s "\t" i "\t" j "\t" F[s, i, j]
}
}
}
}' $(find . -name 'lane*.hist') > byLane.hist
fi
byLane <- read.table('byLane.hist', col.names=c('Sample','Lane','kmerFreq','kmerNum'),
colClasses = c('factor','factor','numeric','numeric'))
ggplot(data = byLane, mapping = aes(x = kmerFreq, y = kmerNum, color = Lane)) +
geom_line() + scale_y_log10() + xlim(1, 500) + facet_wrap(~Sample)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 3000 row(s) containing missing values (geom_path).
Lanes 1 and 2 have almost identical k-mer profiles in all 5 samples analysed here. Filtering low quality reads could improve the profile, if sequencing errors are a big portion of the low multiplicity peak in the k-mer profile. But contamination and sequencing of fragments from out of the selected range can also contribute that that peak.
CHOSEN=(LAN039 BRZ057 THU185 WAL078 175619)
for sample in ${CHOSEN[@]}; do
if [ ! -e $sample/gcp.mx ]; then
kat gcp -N -t 4 -o $sample/gcp ../../data/fastq/$sample*.fastq.gz &> $sample/gcp.log &
fi
done
wait
library('tidyr')
library('gridExtra')
for (Sample in c('LAN039', 'BRZ057', 'THU185', 'WAL078', '175619')) {
gcp <- read.table(paste(Sample, 'gcp.mx', sep='/'))
gcp$GC <- seq(0,26)
gcp_long <- pivot_longer(gcp, 1:1001, names_to = 'kmerFreq', values_to = 'kmerNum')
gcp_long$kmerFreq <- as.numeric(substring(gcp_long$kmerFreq, 2)) - 1
assign(sprintf('p%s', Sample), ggplot(data=gcp_long, mapping=aes(x=kmerFreq, y=GC, fill=kmerNum)) +
scale_fill_gradientn(trans='log', colors=terrain.colors(10), breaks=c(1,100,10000,1e6), na.value='blue') +
geom_tile() + ggtitle(Sample))
}
grid.arrange(pLAN039, pBRZ057, pTHU185, pWAL078, p175619, nrow=3)
SAMPLE=(LAN039 BRZ057 THU185 WAL078 175619)
for first in 0 1 2 3; do
for second in $(seq $(($first + 1)) 4); do
if [ ! -e ${SAMPLE[$first]}/comp_${SAMPLE[$second]}-main.mx ]; then
FASTQ1=$(ls -1 ../../data/fastq/${SAMPLE[$first]}* | grep R1)
FASTQ2=$(ls -1 ../../data/fastq/${SAMPLE[$second]}* | grep R1)
if [ -z "${SIZE[$second]}" ]; then
SIZE[$second]=$(gunzip -c $FASTQ2 | wc -l)
fi
if [ -z "${SIZE[$first]}" ]; then
SIZE[$first]=$(gunzip -c $FASTQ1 | wc -l)
fi
if [ ${SIZE[$first]} -ge ${SIZE[$second]} ]; then
PARAM='--d2_scale'
VALUE=$(echo "${SIZE[$second]} / ${SIZE[$first]}" | bc -l)
else
PARAM='--d1_scale'
VALUE=$(echo "${SIZE[$first]} / ${SIZE[$second]}" | bc -l)
fi
kat comp --output_prefix ${SAMPLE[$first]}/comp_${SAMPLE[$second]} \
--non_canonical_1 \
--non_canonical_2 \
--threads 8 \
$PARAM $VALUE \
--density_plot \
$FASTQ1 $FASTQ2 \
&> ${SAMPLE[$first]}/comp_${SAMPLE[$second]}.log &
fi
done
done
wait
SAMPLE = c('LAN039', 'BRZ057', 'THU185', 'WAL078', '175619')
for (i in 1:4) {
for (j in (i+1):5) {
comp <- read.table(sprintf('%s/comp_%s-main.mx', SAMPLE[i], SAMPLE[j]))
comp$Y <- seq(0,1000)
comp_long <- pivot_longer(comp, 1:1001, names_to = 'X', values_to = 'frequency')
comp_long$X <- as.numeric(substring(comp_long$X, 2)) - 1
assign(sprintf('p%s_%s', SAMPLE[i], SAMPLE[j]),
ggplot(data=comp_long, mapping=aes(x=X, y=Y, fill=frequency)) + geom_tile() + xlim(0,500) + ylim(0,500) +
scale_fill_gradientn(trans='log', colors=terrain.colors(10), breaks=c(1,100,10000,1000000), na.value='blue') +
xlab(sprintf('27-mer frequency in %s', SAMPLE[i])) +
ylab(sprintf('27-mer frequency in %s', SAMPLE[j])))
}
}
layout_matrix <- matrix(c(1, NA, NA, NA,
2, 3, NA, NA,
4, 5, 6, NA,
7, 8, 9, 10), nrow=4, ncol=4, byrow=TRUE)
grid.arrange(pLAN039_BRZ057,
pLAN039_THU185, pBRZ057_THU185,
pLAN039_WAL078, pBRZ057_WAL078, pTHU185_WAL078,
pLAN039_175619, pBRZ057_175619, pTHU185_175619, pWAL078_175619,
layout_matrix = layout_matrix)
Mapleson, Daniel, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, and Bernardo J Clavijo. 2016. “KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies.” Bioinformatics 33 (4): 574–76. https://doi.org/10.1093/bioinformatics/btw663.